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Abstract 

In this paper, parallel CKlog n) algorithms for computation of rigid 

multibody dynamics are developed. These parallel algorithms are derived by 

parallelization of new 0(n) algorithms for the problem. The underlying feature 

of these 0(n) algorithms is a drastically different strategy for decomposition 

of interbody force which leads to a new factorization of the mass matrix (^<). 

Specifically, it is shown that a factorization of the inverse of the mass 

-1 * “1 

matrix in the form of the Schur Complement is derived as M = E - $ A SB, 
wherein matrices C, A, and B are block tridiagonal matrices. The new 0(n) 
algorithm is then derived as a recursive implementation of this factorization 
of M~ l . For the closed-chain systems, similar factorizations and 0(n) 
algorithms for computation of Operational Space Mass Matrix A and its inverse 
A -1 are also derived. It is shown that these 0(n) algorithms are strictly 
parallel, that is, they are less efficient than other algorithms for serial 
computation of the problem. But, to our knowledge, they are the only known 
algorithms that can be parallelized and that lead to both time- and processor- 
optimal parallel algorithms for the problem, i.e. , parallel O(log n) 
algorithms with 0(n) processors. The developed parallel algorithms, in 
addition to their theoretical significance, are also practical from an 
implementation point of view due to their simple architectural requirements. 
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Total number of Degrees-OF-Freedom (DOF) of the system 

Position vector from point 0 to point 0 , P = P 

j i i+i, i i 

Mass of body i 

First and Second Moment of mass of body i about point 0^ 
Spatial Inertia of body i about point 0^, 


I = I = 

i, l l 


^ h , 


m U 

i 


e(R 


,6x6 


Symmetric Positive Definite 
Vector of joint positions 


(* denotes the transpose) 
(SPD) mass matrix 


Vector of joint velocities 


Vector of joint accelerations 

Vector of applied (control) joint forces/torques 
Angular and linear acceleration of body i (frame i+1) 

Linear velocity and acceleration of body i (point 0^ ) 

Force and moment of interaction between body i-1 and body i 


0) 


V = 

i 


F = 

i 


„ _6xn 

H dR i 
1 


dR 


6x1 


i J 


i J 


dR 


6x1 


Spatial acceleration of body i 

Spatial force of interaction between body i-1 and body i 

Spatial axis (map matrix) of joint i 

Table I. Notation 



Figure 1. Body, Frames, and Position Vectors 
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I. Introduction 

The multibody dynamics problem concerns the determination of the motion of 
the mechanical system, resulting from the application of a set of control 
forces. In the context of robotics, the dynamic simulation problem is better 
known as the forward dynamics problem. 

From a computational point of view, the multibody dynamics problem can be 
stated as the solution of a linear system as 

MQ - J - b(0,Q) = y p or (1) 

Q = M (2) 

T 

where the vector b(0,Q) represents the contribution of nonlinear terms and can 
be computed by using the recursive Newton-Euler (N-E) algorithm [3] by setting 
the joint accelerations to zero. Hence, in Eqs. (l)-(2), ^ = col{F }dR nxl 

represents the acceleration-dependent component of the control force. 

The developed serial algorithms for the problem can be classified as the 

0(n 3 ) algorithm [4], the 0(n 2 ) algorithm [5], and the 0( n) algorithms [6-13]. 

See also [14] for more complete references as well as an extensive analysis 

and comparison of these algorithms. In addition to these algorithms, which are 

based on rather direct methods, there is also another class of indirect (or, 

2 

iterative) algorithms for solution of Eq. (1) which include the 0(n ) 
conjugate gradient algorithms [4,15,16]. 

It seems that the development of serial algorithms for the problem has 
reached a certain level of maturity. Asymptotically, the 0(n) algorithms 
represent the fastest possible serial method for the problem, since, given the 
n-component input (vector of control force), the evaluation of the n-component 
output (joint accelerations) requires at least 0(n) distinct steps in the 
computation. Hence, any further improvement in computational efficiency of the 
0(n) algorithms can only be achieved by reducing the coefficients (see for 
example [10,14] wherein this reduction has been achieved by avoiding explicit 
computation of the term b(8,Q}). 

The relationship among the different direct algorithms is also well 
understood, and two fundamental results have been established [14]. The first 
is that, at a conceptual level, the 0(n) algorithms can be essentially 
considered as a procedure for recursive factorization and inversion of mass 
matrix, i.e., recursive computation of M [9,10,11,14]. The second result 
is that, at a computational level, the 0(n) algorithms lead to the computation 
of the articulated-body inertia [7]. The reader is referred to [14] for an 
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extensive analysis of commonalities in the computation of the 0(n) algorithms. 
It should be emphasized that our analysis of the parallel computation 
efficiency of different algorithms relies on these two results. 

Despite the significant improvement in the efficiency of serial algorithms, 
even the fastest algorithm is still far from providing real-time or faster- 
than-real-t ime simulation capability. With the maturity of serial algorithms, 
any further significant improvement in computational efficiency can be 
achieved only through exploitation of parallelism. This is further motivated 
by advances in VLSI technology that have made parallel computation a practical 
and low-cost alternative for achieving significant computational efficiency. 
However, unlike serial computation, there are few reports on the development 
of parallel algorithms for the problem. 

The development of efficient parallel algorithms for multibody dynamics is 
a rather challenging problem. It represents an interesting example for which 
the analysis of the efficiency of a given algorithm for parallel computation 
Is far different and more complex than that for serial computation. In fact, 
our previous analysis [1,2,17] and the results of this paper clearly show that 
those algorithms that are less efficient (in terms of either asymptotic 
complexity or number of operations) for serial computation provide a higher 
degree of parallelism and hence are more efficient for parallel computation. 

A preliminary investigation of parallelism in the computation of forward 
dynamics, analyzing the efficiency of existing algorithms for parallel 
computation, is reported in [2]. The main result of this investigation was 

3 

that the 0(n ) algorithms provide the highest degree of parallelism and are 
the most efficient for parallel computation. Specif ically, it was shown that 

1. Theoretically, the time lower bound of 0(log 2 n) can be achieved by 

3 3 

parallelization of the 0(n ) algorithms by using 0(n ) processors. 

2. Practically, the best parallel algorithm for the problem is of 0(n) which 
results from parallelization of the 0(n ) algorithms on a two-dimensional 
array of 0(n 2 ) processors. This parallel algorithm, although of 0(n), achieves 
a significant speedup over the best serial 0(n) algorithms by reducing the 
coefficient of the n-dependent term on polynomial complexity by more than two 

3 

orders-of -magnitude. Different approaches for parallelization of the 0(n ) 
algorithms have also been proposed in [18,19], 

The analysis in [1] also led to two additional important conclusions. The 
first was that, if indeed there can be a parallel algorithm achieving the time 
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lower bound of CKlog n) with an optimal number of 0(n) processors, then this 
parallel algorithm can only be derived by parallelization of an 0(n) serial 
algorithm. However, the analysis in [1] showed that the parallelism in the 
existing 0(n) algorithms was bounded, that is, at best only a constant speedup 
in the computation can be achieved, leading to the parallel 0(n) algorithms. 
More specifically, it was shown that the recurrence for computation of the 
articulated-body inertia is strictly serial and cannot be parallelized (see 
Sec. II. D). Hence, the second conclusion in [1] was that if the forward 
dynamics problem is to have the time lower bound of O(log n) for its 
computation, it can only result from a totally different serial 0(n) 
algorithm. Such an algorithm can only be derived by a global reformulation of 
the problem and not an algebraic transformation In the computation of existing 
0(n) algorithms. 

Physically, a given algorithm for multibody dynamics caii be classified 
based on its force decomposition strategy. Mathematically, the algorithm can 
be classified based on the resulting factorization of the mass matrix which 
corresponds to the specific force decomposition (see Sec. II. B and C). A new 
algorithm based on a global reformulation of the problem is, then, the one 
that starts with a different and new force decomposition strategy and results 
in a new factorization of mass matrix. 

Interestingly, a recently developed 0(n) algorithm in [21-24] for a single 
serial chain represents such a global reformulation of the problem. It differs 
from the existing 0(n) algorithms in the sense that it is based on a different 
strategy for force decomposition (see Sec. III). We will show that this 
strategy leads to a new and completely different factorization of M. . This 
factorization, in turn, results in a new 0(n) algorithm for the problem which 
is strictly efficient for parallel computation, that is, it is less efficient 
than other 0(n) algorithms for serial computation but, as will be shown, it 
can be parallelized to achieve the time lower bound of O(log n) with 0(n) 
processors. We show that this factorization of X' 1 also directly leads to new 
factorizations and 0(n) algorithms for closed-chain systems. Again, these new 
0(n) algorithms for closed-chain systems can be parallelized to derive both 
time- and processor-optimal parallel algorithms for the problem, i.e., 

O(log n) parallel algorithms with 0(n) processors. Furthermore, the new 
factorizations for both open- and closed-chain systems can be uniformly 
described in terms of the Schur Complement and provide different and deeper 
physical insights into the problem. 
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This paper is organized as follows. In Sec. II, the 0(n) algorithms, i.e, 
the Articulated-Body Inertia algorithm and recursive factorization and 
inversion of mass matrix, are briefly reviewed. In Sec. Ill, the Constraint 
Force algorithm and the new factorization of mass matrix are derived. In 
Sec. IV, new factorizations and 0(n) algorithms for closed-chain systems are 
presented. In Sec. V, parallel O(log n) algorithms for both open- and closed- 
chain systems are briefly presented. Finally, some concluding remarks are made 
in Sec. VI. 


II. The 0(n) Algorithms: Recursive Factorization and Inversion of Mass Matrix 
A. Notation and Preliminaries 


In our discussion of the 0(n) algorithms, a set of spatial notations is 
used which, though slightly different from those in [8-11, 21-24], allows a 
clear understanding and comparison of the algorithms (see also Table I and 
Fig. 1). For the sake of clarity, the spatial quantities are shown with 
upper-case Italic letters. Here, only joints with one revolute DOF are 
considered. However, all the results can be extended to the systems with 
joints having different and more DOF’s. 


With any vector V, a tensor V can be associated whose representation in 
any frame is a skew symmetric matrix as 


V = 


0 

V 

(z) 

-V 


-V 

(z) 

0 

V 


-V 

(x) 

0 


(y) (x) J 

where V , V , and V are the components of V in the considered frame. 
(X)’ (y)’ (z) ^ 

^ ~ -V, 

The tensor V has the properties that V = -V and V V = V xV . A matrix V 

^ ^ 12 12 

associated to the vector V is defined as 


■ u 

V ■ 


’ U 

0 ' 



and V* = 



_ 0 

u 


-V 

U a 


where here (as well as through the rest of the paper) U and 0 stand for unit 
and zero matrices of appropriate size. The spatial forces acting on two 
rigidly connected points A and B are related as 


B A , B A 

where P denotes the position vector from B to A. 
A, B 

velocities of point A are zero then 


= ^ k , B 5 


If the linear and angular 
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The matrix P has the properties P P = P and [P ) 1 - P - 

A, B A j B L A | L d^a 

In derivation of equations of motion, it is assumed that the nonlinear term 
b(0, Q) is explicitly computed by using the recursive N-E algorithm. For both 
the articulated-body algorithm (as shown in [14]) and the new algorithm, this 
explicit computation can be avoided. However, this does not affect the 
efficiency of the algorithms for parallel computation. In fact, as for the 
0(n 3 ) and 0(n 2 ) algorithms [16,17], the explicit computation of b(0,Q) 
provides additional parallelism which can be exploited to further increase the 
speedup in the computation. 


By computing the term b(0,Q) and subtracting it from (7 (Eq. 1), i.e., by 
explicitly computing 2^, the multibody system can be assumed to be a system at 
rest which upon the application of the control force ? accelerates in space. 
The equation of motion for body i, as a single rigid body, is given as 


F = I V 

i i l 


and as an interconnected member of the serial chain is given as (Fig. 1) 


V = P V + H Q 

l l-l l-l l l 

F = I V + P F 

l ii ll+i 


(3) 

(4) 


Eqs. ( 3 ) — (4 ) represent the simplified N-E algorithm (with nonlinear terms 
excluded) for the serial chain. 


Equation (4) represents the interbody force-decomposition strategy of the 
N-E formulation. As shown in [9-11], this force-decomposition strategy leads 
to a specific factorization of M. To see this, let us rewrite Eqs. (3)-(4) as 

V - P* V = H Q (5) 

l l-i l-i i i 

F - P F = I V , (6) 

i ii+l il 

and define 

H = diag{// i >dR 6nxn 

3 = diag{ 7 ^ }e(R 6nx6n 

V = coH^}d? 6nxl 
? = col{F i >dR 6nxl 
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p = 


u 

-p 

r 

0 

0 


n, n-1 


U 

-P 

r 
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n-l , n-2 


u 
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0 0 
U 

P U 

n, n-l 
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n, n-2 
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6nx6n 


n-l , n-2 
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elR 


,6nx6n 


n-l ,1 n, 1 

Eqs. ( 5 ) — C 6 ) can now be rewritten in a global form as 

P*V = HQ (7) 

P2 = 3V ( 8 ) 

A factorization of mass matrix, associated with the force decomposition in 
Eq. (4), can now be derived as 

£ = (9) 

T 

which, in comparison with Eq. (1), represents a factorization of M as 

x = (10) 

-1 * - 1 

Although the matrices P , ?, and {P ) are square and have trivial inverses, 


the matrices W and H are not square. This prevents the computation of M 
from the above factorization. 


-l 


B. The Articulated-Body Inertia (A-BI) Algorithm 

The Articulated-Body Inertia (A-BI) algorithm is based on a decomposition 
of F as [8] 

f = i k v + r A (ID 

i 11 1 

where I A is the articulated-body inertia of body i. The force T * is a function 

of I k and for j = n to i + 1. If I* (and hence 7*), for j = n to i, is 
computed, then the projection of Eq. (6) along the joint axis i leads to a new 
equation with V as the only unknown 

F = HF = H I k V + HT k (12) 

T1 I i ill i i 
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Starting from 1-1, the joint accelerations can then be recursively computed 
from Eq. (12). This clearly explains the motivation behind the specific force 
decomposition in Eq. (11), which, unlike the one in Eq. (4), leads to the 
solution for joint accelerations. 

The computational steps of the A-BI algorithm are given as [8] 

For i = n to 1 


I A = I + P f/ A - i k H ( W* 1 A H )'V J* Ip* 

i J l + l i + 1 1 + 1 1 + 1 l+l 1 + 1 i+1 1 + lJ 1 

i A = i 

n n 

(13) 

r* = p ( r A - I A H (77* i A H ) -1 (F - 77* 7* )1 

1 l ^ i + 1 1 + 11+1 1+1 1 + 1 1+1 Tl+1 i + 1 i+1 J 

o 

II 

■< c 

(14) 

For i = 1 to n 



Q = (F - 77*/ A P* V - 77*7*) (77 * 7* 77 ) -1 

Ti i 1 i-1 i-1 i i i + 1 1 + 1 i+1 

o 

II 

o 

(15) 

V = P* V + H Q 

I 1-1 i-1 i i 


(16) 

C. Recursive Factorization and Inversion of Mass Matrix 




In [9-11], starting with the factorization in Eq. (10), an alternate 
factorization of the mass matrix in terms of square factors is derived as 


M = (U + kV _ 1 k) Z)(U + ttV 1 *)* ( 1? ) 

e ± U - T dR 6nx6n 

P 

3 k ± diag{ J*}e(R 6nx6n 

© = diaglD^} = diag{// ( 1*//^ }c(R nxn (18) 

S = diaglG^ = /w©' 1 elR 6nxn (19) 

X = & *§ c!R 6nxn (2°^ 

P 


The nxn matrices (U + XT ’X), D, and (U + X T *70 are, respectively, lower 
triangular, diagonal, and upper triangular. The factorization in Eq. (17) 
represents the LDL* factorization of the SPD mass matrix (which is unique) in 
an analytical form. Furthermore, due to the positive definiteness of M, the 
matrix T> is nonsingular, that is, * 0 (this is also proved in [8]). 

In [9-11] it is shown that the inverse of the factor (U + X T l X) can be 
derived in an analytical form as 

(u + wV" 1 *)' 1 = (u - hVk) (2D 

where * = }clR 6nx6n is a lower triangular matrix with 

i > J 
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0 = U and 0 = P IU - G H*) eR 6 * 6 , i = n to 1 and j = i-1 to 1 (22) 

I,! *,j II 

From Eqs. (17) and (21), a factorization of M~ l is derived as 

M~ l = (U - - W^X) (23) 

The significant contribution of the work in [9-11] is to exploit further 
structure of the mass matrix (in addition to the symmetry and positive- 
definiteness) and explicitly obtain the above factorization of M~ l . It also 
demonstrates that the force decomposition in Eq. (11) corresponds to this 
factorization of M l . If the articulated-body inertia is computed from 

Eq. (13) and the terms K and ^ are computed according to Eqs. (18)-(20) and 

(22), then from Eqs. (2) and (23) the solution for Q is obtained as 

Q = (U - JM'XjV^U - JM'X)^ (24) 

In [9-11,14] it is shown that the recursive implementation of Eq. (24) results 
in an 0(n) algorithm whose computational steps (with some minor modifications) 
correspond to those in Eqs. (13)-(16). 


D. Parallelism in the 0(n) Algorithms 

The main bottleneck in parallel computation of the A-BI algorithm is the 
computation of I* from Eq. (13), which can be represented, at an abstract 
level, as the solution of a set of first-order nonlinear recurrences 


x = c + 0 (x )/<p (x ) = c + <p(x ) 

I i 2 l+l 1 i+1 i i+1 

where is a constant, 0 ^ and 0^ are polynomials of first and second degree, 
and deg 0 = Max (deg 0^, deg 0 ) =2. It is well known that the parallelism in 
computation of nonlinear recurrences of the above form and with deg 0>1 is 
bounded [25,26], that is, regardless of the number of processors used, their 
computation can be speeded up only by a constant factor. This is due to the 
fact that the data dependency in nonlinear recurrences and particularly those 
containing division is stronger than in linear recurrences [26]. Hence, the 
parallelism in the 0(n) articulated-body based algorithms is bounded and their 
parallelization leads to parallel 0{ n) algorithms which are faster than the 
serial algorithms only by a constant factor. Note that a rather simple model 
was used to describe the nonlinear recurrences for computation of the 
articulated-body inertia, while they are far more complex than those usually 
studied in the literature, e.g., in [25,26], 

However, the computations in Eqs. (14)— (16) can be fully parallelized since 
they can be transformed into a set of first-order linear recurrences (here, 
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due to the lack of space, we do not discuss these transformations). This 
clearly indicates that the main obstacle in parallelization of the 0(n) 
articulated-body based algorithms is the computation of the articulated-body 
inertia. It should also be mentioned that the 0(n) algorithm in [7], which was 
originally developed for serial chains with 3-DOF spherical joints, involves 
nonlinear recurrences which are even more complex than those for computation 
of articulated-body inertia. 

III. The Constraint Force Algorithm 
A. Basic Force Decomposition and Algorithm 

The algorithm in [21-24] is based on a decomposition of interbody force as 

F = H F + V F (25) 

I i Tl I Si 

where F is the constraint force and W is the orthogonal complement of H 
si l 1 

which is defined [27,28] by 

H H* + V V* = U (26) 

ii ii 

The matrix H ^ is a projection matrix and hence 

H*H = U (27) 

i i 

It then follows that the matrix W ^ is also a projection matrix and that [27] 

H W = W*H = 0 and V*W = U (28) 

i i i l i i 

For a joint i with ni DOF’s (ni<6), it follows that tf^clR 6 *" 1 and V^elR 6 ** 6 nl . 
For a more detailed discussion on these projection matrices see [27,28]. 

The decomposition in Eq. (25) seems to be more natural (and perhaps more 
physically intuitive) than those in Eqs. (4) and (11) since it expresses the 
interbody force in terms of two physically more basic components: the control 
(or, working) force and the constraint (or, nonworking) force. In fact, as 
stated in [21], the basic idea of the algorithm was first presented in [29] 
for a system of particles, and later in [30] it was extended to rigid body 
systems. However, both works were concerned with the constraint stabilization 
problem and the algorithm had not been used as an alternative procedure for 
the dynamic simulation problem. Also, the independent derivation of the 
algorithm in [21-24] was mainly motivated by its suitability for parallel 
iterative solution of the dynamic simulation problem. 

It is not surprising that the algorithm has not been considered as a viable 
alternative for direct serial and parallel solution of the multibody dynamics 
problem. The decomposition in Eq. (25) naturally leads to the explicit 
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computation of the constraint (and interbody) forces, which has also motivated 
the designation of the algorithm as the Constraint Force (CF) algorithm. In 
fact, researchers have always argued that since the constraint forces are 
nonworking forces, their computation is not needed and leads to computational 
inefficiency. Consequently, the elimination of the constraint forces from the 
equations of motion has always been considered as a necessary first step in 
the derivation of efficient algorithms. 

Here, for the sake of clarity and self-completeness, we first redrive the 
algorithm as presented in [21-24]. We then show that the force decomposi tion 
in Eq. (25) leads to a new factorization of M 1 . This allows a better under- 
standing of the algorithm as well as its comparison with other algorithms, 
particularly the recursive factorization and inversion of mass matrix. 

Equation (25) can be written in global form as 
? = M3 + W3 (29) 

T S 

with W = diag{W^ >dR 6nx5n and ^ = col{F sj } e[R 5nxl . For global matrices H and 


W t Eqs. (26)-(28) are written as 

HH + WW* = U, HW = W*H = 0, and HH = W* W = U (30) 

From Eqs. (7)-(8) and (30), it follows that 

V = (31) 

W*?*V = W*MQ = 0 (32) 

jrVj'V? = 0 (33) 

and substituting Eq. (29) into Eq. (33) yields 

wWVof? + w$ ) = o =* irVV'm = -wWVw? , or ( 34 ) 

T S 5 T 

= -m (35) 

S T 


where A = W* T 3 l ?W elR 5nx5n and B = wVWtf cR 5nxn . The global constraint 
force, ^ , is computed as the solution of the linear system in Eq. (35), where 
d is a symmetric, positive-definite, block tridiagonal matrix. The global 
interbody force (^) and acceleration ( V ) are then computed from Eqs. (29) and 
(31). Finally, the joint accelerations are computed from Eqs. (7) and (30) as 

H*HQ = KTV *> Q = JfW (36) 

The solution of the linear system in Eq. (35) represents the most 
computationally intensive part of the algorithm. In [21-24], exploiting the 
structure of matrix d (i.e., symmetry, positive-definiteness, and block 
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tridiagonal form), a set of iterative algorithms for solution of Eq. (35) is 
developed. It is shown that these iterative algorithms can be efficiently 
parallelized and implemented on a simple architecture with n processors while 
the rest of the computation is performed serially. Although the computational 
complexity of the developed parallel iterative algorithms is still of 0(n), 
the extensive simulation in [21-24] has shown that the algorithms achieve 
speedup over the serial A-BI algorithm. 


B. A New Factorization of M. 


Here, we extend the work in [21-24] by first deriving an operator form of 
the algorithm and then showing that the force decomposition in Eq. (25) indeed 
leads to a new and interesting factorization of M . From Eqs. (29) and (34) 
the global interbody force can be computed as 


? = - W (wVWw ) WWwj? T (37) 

From Eqs. (8) and (37), V is computed as 

v = - w (wWVw )"WWwj? T 

and finally from Eqs. (36) and (38), Q can be computed as 
q = - KT*3~ l TVf {W*P*S' i 7 > W 

which represents a compact operator form of the algorithm. In comparison with 
Eq. (2), an operator form of J M’ 1 , in terms of its decomposition into a set of 
simpler operators, is given as 

(40) 


(38) 


(39) 


jh -1 = ht*5~ x th - (wVWif ) 


C. Alternate Approach for Factorization of M based on the Schur Complement 

The operator form of X _1 given by Eq. (40) represents an interesting 
mathematical construct. To see r this, let 

e = hWVh dR nxn 

M~ l is now written as 

m ~ 1 = e - sbV 1 ® (4i) 

Consider a matrix i t defined as 



$ 

e 


cE 


6nx6n 


(42) 
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then C - B*A l< B is the Schur Complement of A in H [31] and is designated as 
(j t/A). The structure of matrix % motivates a different and simpler approach 
for derivation of the algorithm. Assume that the spatial acceleration of body 
i is written in terms of two components: one generated by acceleration of 
DOF’s (Q^), and the other generated by acceleration of Degrees-Of-Constraint 

• 5x1 

(DOC’s), denoted as o^elR (of course, by definition <r ^ =0). Then rewrite 


Eqs. (5) and (7) as 

V - P* V = H Q + V <r (43) 

i 1-1 i~l 11 11 

TV = HQ + W± (44) 


with S = col{(r}e(R 5nxl . From Eqs. (8), (29), and (44), it then follows that 

?>W W2 + T £~ X TW = KQ + wi (45) 

S T 

* 

Z and Q can be obtained by multiplying both sides of Eq. (45) first by Vf and 


* 

then by H as 

wVWw? + JrVWx? = Z (46) 

S T 

HVf'm + wVWw? = Q, or (47) 

S T 

aj + m - i (48) 

S T 

SB*? + £? = Q (49) 

S T 


Eq. (39) is then obtained by setting Z = 0 and using the Guassian elimination 

for solving for ? . If the vectors of total acceleration (a ) and total force 
T C 

(? ) are defined as 
G 


A 

a = 

' z 

^6nxl , f~r A 

el R and ? = 

? I 

s 

G 

_ Q . 

T 

SF 

T j 


e(R 


6nxl 


then Eqs. (48) -(49) can be written as 


£2 = a 


(50) 


C G 

The matrix £ can be interpreted as the inverse of the augmented mass matrix; 
it relates the total force and acceleration. M 1 is then the Schur Complement 
of d in £, that is, 

M~ X = (i e/d) (51) 


It should be mentioned that an even simpler physical interpretation of the 
algorithm along with an alternate direct approach for derivation of Eqs. (48)- 
(49) can be given by noting the physical interpretation of the operators H t T t 
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, W , etc. and the matrices A, B, and £ [32]. 


It should also be pointed out that by using the matrix identity 
(C - XDY)' 1 = C" 1 + C _1 X(D _1 - YC -1 X)YC -1 


(52) 


in Eqs. ( 40 ) — ( 4 1 ) an operator expression of M can be obtained as 

m = e" 1 + e'Vu - strVr^e -1 


= (kVVVk) 

(kVj'Vki 


+ (wVrVtfr^HVrVin^VVVin 

(kVj'Vitj'Wj'Vk) (wVVwo ' 1 


(wVWjo 


(53) 


2 

An 0(n ) algorithm for computation of the mass matrix can be derived based on 
the above expression of M (which is asymptotically as fast as the best serial 
algorithms). However, this operator expression of M is significantly more 
complex and its associated algorithm is less efficient than other operator 
expressions and their associated algorithms in Eqs. (10) and (17). 


D. Serial Computation of the Constraint Force Algorithm 

An efficient serial implementation of the 0(n) CF algorithm is based on 
rewriting Eq. (39) as 

Q = (54) 

Here, the key to achieving greater efficiency for both serial and parallel 
computation is to simply perform, as much as possible, the matrix-vector 

multiplication instead of matrix-matrix multiplication. In this regard, the 

* 

matrices £, SB , and £ do not need to be computed explicitly and only the 
explicit computation of 4 is needed. Given the computational steps of the 
algorithm consist of a sequence of matrix-vector multiplications and vector 
additions where the matrices, except for 4 are either bidiagonal or 
diagonal block matrices. Multiplication of a vector by 4 * is equivalent to 
the solution of a symmetric, positive-definite, block tridiagonal system. 

The vector can be computed in 0(n) steps by using the N-E algorithm 
[3]. The matrix-vector multiplications with diagonal or bidiagonal block 
matrices can be performed in 0(n) steps. The solution of the block tridiagonal 
system can also be obtained in 0(n) steps by using block LDL factorization 
[33] in 0(n) steps. Therefore, the computational complexity of the serial CF 
algorithm is of 0(n). 

Note that, however, Eq. (54) is presented in a coordinate-free form. Hence, 
before its implementation the tensors and vectors involved in its computation 
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should be projected onto a suitable frame. The choice of optimal frame for 
projection of equations and other issues regarding efficient serial 
implementation of the CF algorithm are extensively discussed in [1], wherein 
the computation cost in terms of number of operations is also evaluated (see 
Table II). However, as can be seen, even with the most efficient schemes for 
serial implementation, the CF algorithm is significantly less efficient than 
the other algorithms for serial computation (see Table II). For large n, the 
A-BI algorithm is more efficient than the CF algorithm by a factor of about 
2.5 (in terms of the total number of operations) for serial computation. 
Obviously, for smaller n (say n<12), the CF algorithm is also significantly 

2 3 

less efficient than the other 0(n ) or 0(n ) algorithms. 

It should be mentioned that the explicit computation of JA 1 (though it is 

2 

not usually needed) can be performed In 0(n ) steps. To see this, note that in 

Eq. (41) the computation of the term A is equivalent to the solution of a 

block tridiagonal system with n right-hand sides which can be computed in 

0(n ) steps. M can then be explicitly computed by performing a matrix-matrix 

2 

multiplication and a matrix-matrix addition, each in 0(n ) steps. This leads 

2 -1 

to a total computational complexity of 0(n ) for explicit evaluation of JA 

IV. Hew Mass Matrix Factorizations for Computation of Closed-Chain Multibody 
Dynamic Systems 

In this section we briefly discuss the application of the new factorization 
of to the computation of dynamics of closed-chain systems. Our discussion 

follows the treatment of the problem as presented in [34-37], wherein it is 
shown that the main computational problems are the evaluation of the 
Operational Space Mass Matrix A [38,39] and its inverse A l . Note that the 
computation of A is also required for the task space dynamic control of single 
robot arms [38,39]. The matrices A 1 and A are defined as 

A = ) 1 e[R 6x6 and A 1 = 1 3 c(R 6x6 (55) 

where ^c!R 6x6n is the Jacobian matrix. In [34-37] recursive 0(n) algorithms are 
developed for computation of A 1 , and the matrix A is then computed by 
explicit inversion of A 1 . The main computational step in these algorithms is 
the computation of the articulated-body inertia as in Eq. (13). Therefore, as 
discussed in Sec. II. D, these 0(n) algorithms are also strictly serial. 

Here, we show that the new factorization of M 1 directly leads to new 
factorizations of both A 1 and A as well as new 0(n) algorithms for their 
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computation. These new factorizations are similar to that of M 1 since they 
can be described in terms of the Schur Complement and thus provide simple 
physical interpretation and a different and deeper insight into the problem. 
More importantly, however, the resulting algorithms can be parallelized to 
derive both time- and processor-optimal parallel algorithms, i.e., O(log n) 
parallel algorithms with 0(n) processors, for computation of A 1 and A. 

A. A New Factorization of the Inverse of Operational Space Mass Matrix A 1 


An operator expression of £ is derived in [34,35]. Using the notation of 
this paper, this operator expression is given as 

(56) 

6x6n 


} = 0(7>’)' 1 X 


where 0 = [P 0 0 ... 0]clR . From Eqs. (40) and (56), an operator 

n 

expression of A 1 is then derived as 

A -1 = 0 (ZM^XtxWVx - xW' 1 Z > W , (W’*Z > *5‘Vm”W*Z" 1 y , X}X*(? > r 1 0* 

= 0{ (7 , *)' 1 (KK*)? ) *(5' 1 - 9~ l f*W (W r W 1 ? > >0” WV 1 (7 5 )' 1 >0* 

The above expression can be simplified by noting that from Eq. (30), we have 


(57: 


XX = U - WW 

By inserting Eq. (58) into Eq. (57) and after some involved algebraic 
manipulations, a simple operator expression of A 1 is derived as 

A -1 = 02 _1 0* - 05"Vw(M f VVVin'VVV 1 0* 

This expression can be further simplified since 

a _ i 

Z) = 0? 0 -PIP - J 

n n n n,n+l 

6 * = 0Z‘Vlf = [P*j'V 0 0 ... 0] dR £ 


(58) 


(59) 

(60) 
(61) 


_6x5n 

U U . . . U J CD 

n n n 

This factorization of A -1 is then written in the form of the Schur Complement 


(62) 


a' 1 = d - e*/'e 

Note that the matrix A is the same as in Eq. (41). Let us define a matrix £' 
A el 


r £ 


& T> 


clR 


6nx6n 


(63) 


A -1 is then the Schur Complement of A in £' , i.e., A 1 = (£'/A). As in the 
previous section, based on the Schur Complement factorization of A and the 
structure of matrix £’ , a set of linear equations can be formed leading to 
both simple physical interpretation and alternative derivation of this 
factorization of A -1 (see [40] for a more detailed discussion). 


259 


From Eq. (62), A" 1 can be explicitly computed in 0(n) steps as follows. The 

term A X & can be computed in 0(n) steps since it is equivalent to the solution 

of a block tridiagonal system with six right-hand sides. A -1 is then computed 

by performing a matrix-matrix multiplication and a matrix-matrix addition, 

each with a cost of 0(1), leading to a total computation complexity of 0(n). 

However, usually the multiplication of A 1 by a vector, say F , rather than 

n+l 

the explicit computation of A 1 is needed. In this case, it is significantly 
more efficient to directly compute A V rather than first explicitly 

n+l 

compute A and then perform the matrix-vector multiplication. Note that the 
computation of A V is also done in 0(n) steps. 

n+l 


B. A New Factorization of Operational Space Hass Matrix A 


The operator expression of A is derived by using the matrix identity in 


Eq. (52) for inverting the matrix A 1 in Eq. (62) as 

a = (D - eV 1 #)' 1 = d" 1 - - 

(f'p* 

The factorization of A can be further simplified by noting that 

's = zf 1 = (02'Vr 1 = u' 1 r 1 = i 

n,n+l n,n+l 

p = [p*rV oo... o] 

n n n 

n* = (0? -1 /3 = t(P ) -1 J (P*) _1 P*J“V 0 0 ... 0] 

n n n n n n 


(64) 


(65) 

( 66 ) 
(67) 


= IP V 0 0 ... 0 ] 

n,n+l n 

?’ _1 = J~ X p* [p9~ 1 p*)~ 1 pj~ 1 - ■?’* = Diag{ I ’ 1 >clR 6nx6n (68) 

with /’ 1 = 0 and I’ 1 = -I 1 , i = n-1, n-2, .... 1 

n i I 

y- = w* T 3' "Vlf (69) 

Note that the matrix y is a rank one modification of matrix A in Eq. (41). The 
factorization of A is then written in terms of the Schur Complement as 

A = $ - ftV 1 ?? (70) 

Let us define a matrix £’ ’ as 


y 

« 

‘R 


§• 


elR 


6nx6n 


(71) 


A is then the Schur Complement of IP in J?” , i.e., A = (if’Vy). Again, based on 
the Schur Complement factorization of A 1 and the structure of matrix if”, a 
set of linear equations can be formed leading to both simple physical 
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interpretation and alternative derivation of this factorization of A 1 (see 
[40] for a more detailed discussion). 

As for A -1 , from Eq. (70) A can be explicitly computed in 0(n) steps since 

the term y’" 1 ?? can be computed as the solution of a block tridiagonal system 

with six right-hand sides. A is then computed by performing a matrix-matrix 

multiplication and a matrix-matrix addition, each with a cost of 0(1), leading 

to a total computation complexity of 0(n). However, usually the multiplication 

of A by a vector, say V , rather than the explicit computation of A is 

n+1 

needed. In this case, it is significantly more efficient to directly compute 
AF rather than first explicitly compute A 1 and then perform the matrix- 

n+1 

vector multiplication. Again, note that the computation of A^ +j is done in 
0(n) steps. 

V. Parallel 0(log n) Algorithms for Computation of Open- and Closed-Chain 
Rigid Multibody Systems 

The parallel implementation of the CF algorithm for a serial open-chain 
system is extensively discussed in [1]. Here, we briefly present the results 
of [1] and their extension to the computation of closed-chain systems. 

A. Parallel 0(log n) Algorithms for Open-Chain Rigid Multibody Systems 

The computation of the parallel CF algorithm is performed as follows. 

Step I. Projection and Computation of Matrix d 

The projection of vectors and tensors and the explicit computation of 
matrix d is performed in 0(1) steps with n processors. 

Step II. Computation of (F 

By using the algorithm in [20], f is computed in 0(log n)+0(l) steps 
with n processors. 

Step III. Computation of t ? = £*? t = and = hWVh? t 

The computation of E t and Q g involves two sequences of matrix-vector 
multiplication wherein the matrices are bidiagonal or diagonal block matrices 
and is performed in 0(1) steps with n processors. 

Step IV. Computation of = d *Z T 

The SPD block tridiagonal system is solved in 0(log n)+0(l) steps with n 
processors and by using the Odd-Even Elimination (0EE) variant of the cyclic 
reduction algorithm [41], 
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Step V. Computation of Q * B? = J( H 1 PW& and Q = Q + Q 

s s s ST 

The computation of Q 1 is similar to that of in Step III and is 
performed in 0(1) steps with n processors. 

As can be seen, the overall computational complexity of the parallel CF 
algorithm is of 0(log n)+0(l) by using n processors. In [1] it is shown that 
the algorithm can be efficiently implemented on an SIMD parallel architecture 
with n processors and with a Shuffle-Exchange augmented with Nearest-Neighbor 
(SENN) interconnection. The SENN interconnection allows a perfect mapping, 
i.e., with no topological variation, of the parallel algorithm since it 
perfectly matches the inherent communication structure of different steps of 
the algorithm and thus leads to minimum communication cost. In [1] the 
computation and communication cost of the parallel CF are evaluated as 
(732m+653a) flog^n] + (542m+439a ) and ( 134 [log 2 n] +49 )c, where m, a, and c stand 
for the cost of multiplication, addition, and communicating a single datum 
( [V| is the smallest integer greater than or equal to x). 

Note that the parallel algorithm, while achieving the time lower bound in 
the computation, remains highly compute-bound. The ratio of the computation 
cost over communication cost is greater than 10, which indicates that the 
parallel algorithm has a rather large grain size and thus can be efficiently 
implemented on commercially available MIMD parallel architectures. In fact, 
the parallel CF algorithm is currently being implemented on an MIMD Hypercube 
parallel architecture. Furthermore, the parallel CF algorithm allows the 
exploitation of parallelism at several computational levels. In [1] it is 
shown that a greater speedup in the computation can be achieved by exploiting 
a multilevel parallelism and implementing the algorithm on an architecture 
with 3n processors. 

B. Parallel 0(log n) Algorithms for Closed-Chain Rigid Multibody Systems 

1. Computation of A 1 and A V 

The explicit evaluation of matrix A, similar to Step I of Sec. V. A, can be 
performed in 0(1) steps with n processors. The term A 1 8 in Eq. (62) 
represents the solution of an SPD block tridiagonal system with 6 right-hand 
sides and can be computed in 0( log n)+ 0(1) steps with n processors by using 
the OEE variant of the cyclic reduction algorithm. A 1 is then computed by 
performing a matrix-matrix multiplication and a matrix-matrix addition wherein 
each operation can be performed in 0(1) steps with 0(1) processors. This leads 
to a computational complexity of 0( log n)+0(l) with n processors, which 
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indicates a both time- and processor-optimal parallel algorithm for evaluating 
A l . Similar results with greater computation efficiency can be obtained when 
evaluating A since the solution of an SPD block tridiagonal system with 

a single right-hand side is needed. 

2. Computation of A 1 and A V 

n+l 

The explicit evaluation of matrix similar to that of can be performed 
in 0(1) steps with n processors. The rest of the computation in Eq. (70) is 
similar to that in Eq. (62). Therefore, both A 1 and A can be computed 

n+1 

in 0(log n)+0(l) steps with n processors, which indicates both time- and 
processor-optimal parallel algorithms for the computations. Note, however, 
that, unlike the matrix j 4, which is always SPD, for some configurations, the 
matrix if can become singular [34-37], and thus special care should be taken in 
computation of Eq. (70). However, it should be mentioned that the new and 
simple factorization of both A 1 and A provides much better insight for the 
analysis of the singularity in the computation of A [40], 

VI. Discussion and Conclusion 

In this paper, we presented parallel 0(log n) algorithms for computation of 
open- and closed-chain rigid multibody dynamics. These parallel algorithms 
were derived from new 0(n) algorithms for the problem. These 0(n) algorithms 
are based on a new force-decomposition strategy which results in new 
factorizations of M \ A l , and A, presented in Eqs. (40), (62), and (70). 

Some important conceptual features of these new algorithms and their 
underlying factorizations can be summarized as follows. 

1. The factorizations of M l 9 A \ and A are very similar and can be described 
in terms of the Schur Complement. Due to this similarity, both serial and 
parallel algorithms involve similar computational steps. 

2. Compared to the previous factorization of mass matrix, which is based on a 
multiplicative decomposition of M 1 (see Eq. (23)), the new factorization 
leads to an additive decomposition of M' 1 which involves simpler matrices, 
i.e. , block tridiagonal matrices. 

3. Unlike the previous approaches, wherein A 1 is first recursively computed 
and then A is obtained by explicit inversion of A" 1 , independent 
factorizations for both A 1 and A are derived, which allows direct computation 
of either of them. 

From a computational point of view, the main feature of the new algorithms 
is that they are strictly parallel algorithms. That is, as was shown through 
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Algorithm 

Computation Cost 

Number of Processors 


A-BI 

586n - 371 

- 

Se r I a. 1 — 

CF 

1500n - 755 

- 


SL CF 

1385[log 2 n] + 981 

n 

Paral lei 

ML CF 

780 ["log^n] + 595 

3n 


0(n 3 ) 

6n+69[log 2 n'| + 340 

n(n+l )/2 


A-BI: Articulated-Body Inertia Algorithm. CF: Constraint Force Algorithm 
SL CF: Single Level Parallel CF. ML CF: Multilevel Parallel CF 
0(n 3 ): Parallel 0(n 3 ) algorithm in [2]. 

Table II. Computation Costs of Serial and Parallel Algorithms 

an extensive analysis in [1] for a single serial chain, the algorithm is less 
efficient than other 0(n) algorithms for serial computation (see Table II). 
However, based on the analysis in Sec. II. D, they are the only known 
algorithms that can be parallelized and lead to both time- and processor- 
optimal parallel algorithms for the problem. 

The computation costs of different serial and parallel algorithms for the 
problem are presented in Table II, wherein it is assumed that m = a. As can be 
seen, for small n, the parallel algorithm resulting from parallelization of 
the 0(n 3 ) algorithm with 0(n 2 ) processors is the most efficient. However, as n 
increases, so does the efficiency of the parallel O(log n) algorithms. 

As the last point, it should be emphasized that the parallel algorithms 
developed in this paper — in addition to being theoretically significant by 
proving, for the first time, the existence of both time- and processor-optimal 
parallel algorithms for the problem — are also highly practical from an 
implementation point of view. This is due to their large grain size and 
simple communication and processor interconnection requirements. In fact, 
these algorithms are currently being implemented on a Hypercube parallel 
architecture. 
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